## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## corrplot 0.84 loaded

1 Introduction

1.1 Literature Review

1.2 Problem statement

QUESTION: For those who pay tip by credit card, what factors influence the amount of tip given? Should you pay tip to the taxi driver? If yes, how much?

Analyse the factors which takes place to determine the tip amount paid to the driver

Why? Benefits? Use?

Hypothesis: Factors that can affect tip amount: 1. Vendor service 2. Driver service 3. Location : (generous locations) 4. duration of day 5. distance 6. count of passengers 7. weather

2 Data preprocessing

2.1 Load data

write where data is collected from

## Observations: 20,000
## Variables: 19
## $ X                     <int> 4524218, 6458048, 3369795, 18532, 1743670,…
## $ DOLocationID          <int> 211, 249, 161, 4, 107, 246, 237, 125, 142,…
## $ PULocationID          <int> 90, 125, 68, 87, 234, 230, 163, 249, 236, …
## $ VendorID              <int> 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, …
## $ tpep_pickup_datetime  <fct> 6/16/19 0:15, 6/28/19 0:09, 6/14/19 23:04,…
## $ tpep_dropoff_datetime <fct> 6/16/19 0:28, 6/28/19 0:16, 6/14/19 23:22,…
## $ passenger_count       <int> 1, 1, 2, 1, 1, 6, 1, 2, 1, 1, 1, 1, 1, 1, …
## $ trip_distance         <dbl> 1.60, 1.12, 2.72, 2.90, 0.62, 1.90, 0.96, …
## $ RatecodeID            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ store_and_fwd_flag    <fct> N, N, N, N, N, N, N, N, N, N, N, N, N, N, …
## $ payment_type          <int> 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 1, 2, …
## $ fare_amount           <dbl> 10.0, 6.5, 13.5, 11.0, 5.0, 11.0, 7.0, 7.0…
## $ extra                 <dbl> 0.5, 0.5, 0.5, 3.5, 0.5, 0.5, 0.0, 3.0, 0.…
## $ mta_tax               <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.…
## $ tip_amount            <dbl> 0.00, 2.06, 2.60, 0.00, 1.76, 0.00, 1.00, …
## $ tolls_amount          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ improvement_surcharge <dbl> 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.…
## $ total_amount          <dbl> 13.80, 12.36, 19.90, 15.30, 10.56, 14.80, …
## $ congestion_surcharge  <dbl> 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.…
## Observations: 265
## Variables: 4
## $ LocationID   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ Borough      <fct> EWR, Queens, Bronx, Manhattan, Staten Island, State…
## $ Zone         <fct> Newark Airport, Jamaica Bay, Allerton/Pelham Garden…
## $ service_zone <fct> EWR, Boro Zone, Boro Zone, Yellow Zone, Boro Zone, …

The dataset provides a location ID that corresponds to a taxi zone in each of the five boroughs. These nominal variables do not provide much value in its integer format since we do not know the geographical locations of each location ID. We downloaded a taxi zone and ID dataset that provides the boroughs for each location ID. The dataset also indicates the specific neighborhoods within each borough. We merged that dataset to the taxi dataset to identify the borough for both pick up and drop off.

## [1] EWR           Bronx         Manhattan     Queens        Brooklyn     
## [6] Staten Island Unknown      
## Levels: Bronx Brooklyn EWR Manhattan Queens Staten Island Unknown

2.2 Data statistics

Looking at the distribution of raw tip amount, it is clear that it is not normally distributed.

## Error in boxplot(processed_df$tip_amount): object 'processed_df' not found
## Error in qqnorm(processed_df$tip_amount): object 'processed_df' not found
## Error in quantile(y, probs, names = FALSE, type = qtype, na.rm = TRUE): object 'processed_df' not found

A normal distribution is often an assumption for many statistical analyses. Generally. raw tip amounts vary because the fare amounts vary. One factor that may not necessarily vary is tipping percentage. In the US, there is often a standardized percentage that a customer gives (for example, 15% at restaurants). We divided the fare amount by the tip amount to obtain a tipping percentage:

per_tip creation

##   DOLocationID    PULocationID         X              VendorID    
##  Min.   :  1.0   Min.   :  3.0   Min.   :   1016   Min.   :1.000  
##  1st Qu.:107.0   1st Qu.:114.0   1st Qu.:1725764   1st Qu.:1.000  
##  Median :162.0   Median :161.0   Median :3459818   Median :2.000  
##  Mean   :160.4   Mean   :161.9   Mean   :3461833   Mean   :1.642  
##  3rd Qu.:233.0   3rd Qu.:233.0   3rd Qu.:5200940   3rd Qu.:2.000  
##  Max.   :265.0   Max.   :265.0   Max.   :6940096   Max.   :4.000  
##                                                                   
##     tpep_pickup_datetime   tpep_dropoff_datetime passenger_count
##  6/11/19 7:56 :    7     6/24/19 18:36:    6     Min.   :0.000  
##  6/14/19 13:28:    6     6/27/19 21:45:    6     1st Qu.:1.000  
##  6/3/19 15:08 :    6     6/29/19 0:26 :    6     Median :1.000  
##  6/11/19 13:53:    5     6/1/19 22:46 :    5     Mean   :1.565  
##  6/14/19 23:16:    5     6/10/19 11:17:    5     3rd Qu.:2.000  
##  6/20/19 9:26 :    5     6/11/19 19:25:    5     Max.   :6.000  
##  (Other)      :19966     (Other)      :19967                    
##  trip_distance      RatecodeID    store_and_fwd_flag  payment_type  
##  Min.   : 0.000   Min.   :1.000   N:19893            Min.   :1.000  
##  1st Qu.: 0.990   1st Qu.:1.000   Y:  107            1st Qu.:1.000  
##  Median : 1.645   Median :1.000                      Median :1.000  
##  Mean   : 3.037   Mean   :1.054                      Mean   :1.291  
##  3rd Qu.: 3.100   3rd Qu.:1.000                      3rd Qu.:2.000  
##  Max.   :51.200   Max.   :5.000                      Max.   :4.000  
##                                                                     
##   fare_amount          extra           mta_tax          tip_amount     
##  Min.   :-160.00   Min.   :-1.000   Min.   :-0.5000   Min.   :  0.000  
##  1st Qu.:   6.50   1st Qu.: 0.000   1st Qu.: 0.5000   1st Qu.:  0.000  
##  Median :   9.50   Median : 0.500   Median : 0.5000   Median :  1.960  
##  Mean   :  13.47   Mean   : 1.163   Mean   : 0.4949   Mean   :  2.277  
##  3rd Qu.:  15.00   3rd Qu.: 2.500   3rd Qu.: 0.5000   3rd Qu.:  3.000  
##  Max.   : 399.20   Max.   : 7.000   Max.   : 0.5000   Max.   :175.000  
##                                                                        
##   tolls_amount     improvement_surcharge  total_amount    
##  Min.   :-6.1200   Min.   :-0.3000       Min.   :-160.80  
##  1st Qu.: 0.0000   1st Qu.: 0.3000       1st Qu.:  11.30  
##  Median : 0.0000   Median : 0.3000       Median :  14.80  
##  Mean   : 0.4059   Mean   : 0.2985       Mean   :  19.56  
##  3rd Qu.: 0.0000   3rd Qu.: 0.3000       3rd Qu.:  21.20  
##  Max.   :43.4300   Max.   : 0.3000       Max.   : 400.00  
##                                                           
##  congestion_surcharge         Borough_pu            Borough_do   
##  Min.   :-2.500       Bronx        :   35   Bronx        :  159  
##  1st Qu.: 2.500       Brooklyn     :  242   Brooklyn     :  810  
##  Median : 2.500       EWR          :    0   EWR          :   40  
##  Mean   : 2.273       Manhattan    :18088   Manhattan    :17661  
##  3rd Qu.: 2.500       Queens       : 1466   Queens       : 1073  
##  Max.   : 2.750       Staten Island:    1   Staten Island:    3  
##                       Unknown      :  168   Unknown      :  254  
##                         Zone            service_zone      per_tip       
##  Midtown Center           :  791   Airports   :  459   Min.   : 0.0000  
##  Upper East Side North    :  765   Boro Zone  : 2639   1st Qu.: 0.0000  
##  Upper East Side South    :  758   EWR        :   40   Median : 0.2267  
##  Murray Hill              :  619   N/A        :  254   Mean   : 0.1839  
##  Times Sq/Theatre District:  614   Yellow Zone:16608   3rd Qu.: 0.2878  
##  (Other)                  :16380                       Max.   :11.4400  
##  NA's                     :   73                       NA's   :15
## Error in is.data.frame(x): object 'processed_df' not found
## Error in mean(processed_df$per_tip): object 'processed_df' not found

2.3 Outlier detection and data filters

The dataset comes with an associated data definition guide.

## 'data.frame':    12965 obs. of  24 variables:
##  $ DOLocationID         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PULocationID         : int  233 231 186 234 231 161 246 68 50 132 ...
##  $ X                    : int  14365 1376 8878 13985 9970 7812 3427 13195 9857 7422 ...
##  $ VendorID             : int  2 1 1 1 2 1 2 2 2 1 ...
##  $ tpep_pickup_datetime : Factor w/ 15407 levels "5/31/19 23:58",..: 13958 4851 10536 8018 11326 1673 8964 12138 12514 12942 ...
##  $ tpep_dropoff_datetime: Factor w/ 15299 levels "6/1/19 0:04",..: 13903 4833 10443 7946 11238 1655 8895 12052 12466 12912 ...
##  $ passenger_count      : int  1 1 1 1 1 1 4 1 1 1 ...
##  $ trip_distance        : num  23.9 15.5 16.7 14.2 13.5 ...
##  $ RatecodeID           : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ store_and_fwd_flag   : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ payment_type         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ fare_amount          : num  96 73.5 71 64 54.5 68 61 61.5 70.5 117 ...
##  $ extra                : num  0 1 0 0 0 0 1 0.5 1 0 ...
##  $ mta_tax              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tip_amount           : num  29.2 18.4 12 19.1 13.1 ...
##  $ tolls_amount         : num  20.5 17.5 10.5 12.5 10.5 ...
##  $ improvement_surcharge: num  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
##  $ total_amount         : num  146 110.8 93.8 96 78.4 ...
##  $ congestion_surcharge : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Borough_pu           : Factor w/ 7 levels "Bronx","Brooklyn",..: 4 4 4 4 4 4 4 4 4 5 ...
##  $ Borough_do           : Factor w/ 7 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Zone                 : Factor w/ 261 levels "Allerton/Pelham Gardens",..: 169 169 169 169 169 169 169 169 169 169 ...
##  $ service_zone         : Factor w/ 5 levels "Airports","Boro Zone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ per_tip              : num  0.304 0.251 0.169 0.299 0.24 ...

##Pickup column

2.4 Dropoff column

##Trip_duration

## 'data.frame':    12220 obs. of  36 variables:
##  $ DOLocationID         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PULocationID         : int  231 161 68 143 68 125 164 87 230 100 ...
##  $ X                    : int  9970 7812 13195 12280 2391 15423 1082 1862 9274 11863 ...
##  $ VendorID             : int  2 1 2 2 1 1 2 2 2 2 ...
##  $ tpep_pickup_datetime : Factor w/ 15407 levels "5/31/19 23:58",..: 11326 1673 12138 9232 7771 14662 6282 3135 6642 14376 ...
##  $ tpep_dropoff_datetime: Factor w/ 15299 levels "6/1/19 0:04",..: 11238 1655 12052 9152 7686 14570 6243 3092 6579 14275 ...
##  $ passenger_count      : int  1 1 1 1 1 1 1 1 5 2 ...
##  $ trip_distance        : num  13.5 17.3 16.4 17.8 15.3 ...
##  $ RatecodeID           : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ store_and_fwd_flag   : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ payment_type         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ fare_amount          : num  54.5 68 61.5 65.5 60 57.5 58 69.5 64 65.5 ...
##  $ extra                : num  0 0 0.5 0 0 0 0 0.5 0.5 0 ...
##  $ mta_tax              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tip_amount           : num  13.1 15.8 10 16.7 8 ...
##  $ tolls_amount         : num  10.5 10.5 12.5 17.5 10.5 23 10.5 23 10.5 17.5 ...
##  $ improvement_surcharge: num  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
##  $ total_amount         : num  78.4 94.5 84.8 100 78.8 ...
##  $ congestion_surcharge : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Borough_pu           : Factor w/ 7 levels "Bronx","Brooklyn",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Borough_do           : Factor w/ 7 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Zone                 : Factor w/ 261 levels "Allerton/Pelham Gardens",..: 169 169 169 169 169 169 169 169 169 169 ...
##  $ service_zone         : Factor w/ 5 levels "Airports","Boro Zone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ per_tip              : num  0.24 0.232 0.163 0.254 0.133 ...
##  $ pickup_datetime      : POSIXct, format: "0019-06-29 07:40:00" "0019-06-12 12:12:00" ...
##  $ pickup_time          : chr  "07:40" "12:12" "20:09" "07:11" ...
##  $ pickup_hrs           : chr  "07" "12" "20" "07" ...
##  $ pickup_hrs_num       : num  7 12 20 7 6 16 12 4 3 6 ...
##  $ pickup_time_type     : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 1 2 3 1 1 2 2 4 4 1 ...
##  $ dropoff_datetime     : POSIXlt, format: "0019-06-29 08:01:00" "0019-06-12 12:47:00" ...
##  $ dropoff_time         : chr  "08:01" "12:47" "20:36" "07:45" ...
##  $ dropoff_hrs          : chr  "08" "12" "20" "07" ...
##  $ dropoff_hrs_num      : num  8 12 20 7 7 16 12 4 3 7 ...
##  $ dropoff_time_type    : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 1 2 3 1 1 2 2 4 4 1 ...
##  $ trip_duration_mins   : 'difftime' num  21 35 27 34 ...
##   ..- attr(*, "units")= chr "mins"
##  $ trip_duration_mins1  : num  21 35 27 34 26 34 31 31 24 31 ...

2.6 EDA

EDA questions: 1. What places give max and min tip? 2. How is the tip amount varying? 3. What time of the day gives your more tip? 4. Distribution across count of passenger? 5. Distance and time variation in the trip?

## # A tibble: 3 x 5
##   VendorID     n average    min   max
##      <int> <int>   <dbl>  <dbl> <dbl>
## 1        1  4496   0.261 0.0952 0.435
## 2        2  7678   0.264 0.0952 0.434
## 3        4    46   0.269 0.108  0.416

## Error in hist(check_2$per_tip, probability = T, main = "Histogram of normal\ndata", : object 'check_2' not found
## Error in density(check_2$per_tip): object 'check_2' not found

## [1] 0.07113667
## [1] 0.2630256

##                    Df Sum Sq  Mean Sq F value Pr(>F)  
## passenger_count     6   0.06 0.010469    2.07 0.0533 .
## Residuals       12213  61.77 0.005058                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.6.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.1

3 MADHURI

## 
##  Anderson-Darling normality test
## 
## data:  processed_df$per_tip
## A = 51.601, p-value < 2.2e-16

##                     Df Sum Sq Mean Sq F value   Pr(>F)    
## pickup_time_type     3   0.20 0.06797   13.47 8.97e-09 ***
## Residuals        12216  61.63 0.00504                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = per_tip ~ pickup_time_type, data = processed_df)
## 
## $pickup_time_type
##                           diff           lwr          upr     p adj
## Afternoon-Morning  0.002581970 -0.0020717179 0.0072356582 0.4833023
## Evening-Morning    0.010398583  0.0059664376 0.0148307280 0.0000000
## Night-Morning      0.005867182  0.0009774036 0.0107569603 0.0110525
## Evening-Afternoon  0.007816613  0.0032848786 0.0123483466 0.0000557
## Night-Afternoon    0.003285212 -0.0016950125 0.0082654361 0.3263547
## Night-Evening     -0.004531401 -0.0093052601 0.0002424584 0.0700097

p<0.05 hence we reject null hypothesis null hyp = means for all the day types are equal conclusion : Avg tip amt for diffrent time in the day is not equal not all of the means are equal and the results are statistically significant

##                      Df Sum Sq Mean Sq F value   Pr(>F)    
## dropoff_time_type     3   0.22 0.07189   14.25 2.87e-09 ***
## Residuals         12216  61.62 0.00504                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = per_tip ~ dropoff_time_type, data = processed_df)
## 
## $dropoff_time_type
##                           diff           lwr          upr     p adj
## Afternoon-Morning  0.002019670 -0.0026711257  0.006710466 0.6856715
## Evening-Morning    0.010641669  0.0061625724  0.015120765 0.0000000
## Night-Morning      0.004394981 -0.0004606098  0.009250572 0.0922912
## Evening-Afternoon  0.008621999  0.0040836819  0.013160315 0.0000064
## Night-Afternoon    0.002375311 -0.0025349617  0.007285584 0.5994586
## Night-Evening     -0.006246688 -0.0109551391 -0.001538236 0.0036582

p<0.05 hence we reject null hypothesis null hyp = means for all the day types are equal conclusion : Avg tip amt for diffrent time in the day is not equal not all of the means are equal and the results are statistically significant

## Warning in if (freq) x$counts else x$density: the condition has length > 1
## and only the first element will be used
## Warning in if (!freq) "Density" else "Frequency": the condition has length
## > 1 and only the first element will be used

## 
##  Welch Two Sample t-test
## 
## data:  processed_df$per_tip and processed_df$trip_duration_mins1
## t = -179.56, df = 12221, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -12.88660 -12.60829
## sample estimates:
##  mean of x  mean of y 
##  0.2630256 13.0104746
## Warning in cor(processed_df_num): the standard deviation is zero

p<0.05 hence we reject null hypothesis null hyp = means for per_tip and trip_duration_min are equal conclusion : true difference in means is not equal to 0 means are not equal and the results are statistically significant

<<<<<<< HEAD #Madhuri Ends #################################### ======= >>>>>>> 29d0f71f0251384c4f2b8c5e24b3c721927ff061

3.0.1 Tip payed vs distance travelled by passengers

Variables…

Which test to perform and why?

Assumptions/Criteria to be fulfilled:
  • Random sampling data
  • Normally distributed dependent variable (CLT)
  • Independence of observations
First Step in Significance testing:
  • Null Hypothesis: Ho Average tip amount is same for both short and long distance passenger(s)
  • Alternate Hypothesis: Ha Average tip amount is NOT same for both short and long distance passenger(s)
## Observations: 12,220
## Variables: 2
## $ tip_amount    <dbl> 13.06, 15.75, 10.00, 16.66, 8.00, 12.00, 17.20, 14…
## $ trip_distance <dbl> 13.46, 17.30, 16.41, 17.84, 15.30, 13.30, 14.27, 1…
## Observations: 12,220
## Variables: 2
## $ tip_amount    <dbl> 13.06, 15.75, 10.00, 16.66, 8.00, 12.00, 17.20, 14…
## $ trip_distance <fct> Longer Distances, Longer Distances, Longer Distanc…

## 'data.frame':    12220 obs. of  36 variables:
##  $ DOLocationID         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PULocationID         : int  231 161 68 143 68 125 164 87 230 100 ...
##  $ X                    : int  9970 7812 13195 12280 2391 15423 1082 1862 9274 11863 ...
##  $ VendorID             : int  2 1 2 2 1 1 2 2 2 2 ...
##  $ tpep_pickup_datetime : Factor w/ 15407 levels "5/31/19 23:58",..: 11326 1673 12138 9232 7771 14662 6282 3135 6642 14376 ...
##  $ tpep_dropoff_datetime: Factor w/ 15299 levels "6/1/19 0:04",..: 11238 1655 12052 9152 7686 14570 6243 3092 6579 14275 ...
##  $ passenger_count      : Factor w/ 7 levels "0","1","2","3",..: 2 2 2 2 2 2 2 2 6 3 ...
##  $ trip_distance        : num  13.5 17.3 16.4 17.8 15.3 ...
##  $ RatecodeID           : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ store_and_fwd_flag   : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ payment_type         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ fare_amount          : num  54.5 68 61.5 65.5 60 57.5 58 69.5 64 65.5 ...
##  $ extra                : num  0 0 0.5 0 0 0 0 0.5 0.5 0 ...
##  $ mta_tax              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tip_amount           : num  13.1 15.8 10 16.7 8 ...
##  $ tolls_amount         : num  10.5 10.5 12.5 17.5 10.5 23 10.5 23 10.5 17.5 ...
##  $ improvement_surcharge: num  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
##  $ total_amount         : num  78.4 94.5 84.8 100 78.8 ...
##  $ congestion_surcharge : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Borough_pu           : Factor w/ 7 levels "Bronx","Brooklyn",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Borough_do           : Factor w/ 7 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Zone                 : Factor w/ 261 levels "Allerton/Pelham Gardens",..: 169 169 169 169 169 169 169 169 169 169 ...
##  $ service_zone         : Factor w/ 5 levels "Airports","Boro Zone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ per_tip              : num  0.24 0.232 0.163 0.254 0.133 ...
##  $ pickup_datetime      : POSIXct, format: "0019-06-29 07:40:00" "0019-06-12 12:12:00" ...
##  $ pickup_time          : chr  "07:40" "12:12" "20:09" "07:11" ...
##  $ pickup_hrs           : chr  "07" "12" "20" "07" ...
##  $ pickup_hrs_num       : num  7 12 20 7 6 16 12 4 3 6 ...
##  $ pickup_time_type     : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 1 2 3 1 1 2 2 4 4 1 ...
##  $ dropoff_datetime     : POSIXlt, format: "0019-06-29 08:01:00" "0019-06-12 12:47:00" ...
##  $ dropoff_time         : chr  "08:01" "12:47" "20:36" "07:45" ...
##  $ dropoff_hrs          : chr  "08" "12" "20" "07" ...
##  $ dropoff_hrs_num      : num  8 12 20 7 7 16 12 4 3 7 ...
##  $ dropoff_time_type    : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 1 2 3 1 1 2 2 4 4 1 ...
##  $ trip_duration_mins   : 'difftime' num  21 35 27 34 ...
##   ..- attr(*, "units")= chr "mins"
##  $ trip_duration_mins1  : num  21 35 27 34 26 34 31 31 24 31 ...
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.09524 0.22615 0.26609 0.26303 0.30857 0.43478
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.010   1.600   2.456   2.700  25.100

## 'data.frame':    12220 obs. of  36 variables:
##  $ DOLocationID         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PULocationID         : int  231 161 68 143 68 125 164 87 230 100 ...
##  $ X                    : int  9970 7812 13195 12280 2391 15423 1082 1862 9274 11863 ...
##  $ VendorID             : int  2 1 2 2 1 1 2 2 2 2 ...
##  $ tpep_pickup_datetime : Factor w/ 15407 levels "5/31/19 23:58",..: 11326 1673 12138 9232 7771 14662 6282 3135 6642 14376 ...
##  $ tpep_dropoff_datetime: Factor w/ 15299 levels "6/1/19 0:04",..: 11238 1655 12052 9152 7686 14570 6243 3092 6579 14275 ...
##  $ passenger_count      : Factor w/ 7 levels "0","1","2","3",..: 2 2 2 2 2 2 2 2 6 3 ...
##  $ trip_distance        : num  13.5 17.3 16.4 17.8 15.3 ...
##  $ RatecodeID           : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ store_and_fwd_flag   : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ payment_type         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ fare_amount          : num  54.5 68 61.5 65.5 60 57.5 58 69.5 64 65.5 ...
##  $ extra                : num  0 0 0.5 0 0 0 0 0.5 0.5 0 ...
##  $ mta_tax              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tip_amount           : num  13.1 15.8 10 16.7 8 ...
##  $ tolls_amount         : num  10.5 10.5 12.5 17.5 10.5 23 10.5 23 10.5 17.5 ...
##  $ improvement_surcharge: num  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
##  $ total_amount         : num  78.4 94.5 84.8 100 78.8 ...
##  $ congestion_surcharge : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Borough_pu           : Factor w/ 7 levels "Bronx","Brooklyn",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Borough_do           : Factor w/ 7 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Zone                 : Factor w/ 261 levels "Allerton/Pelham Gardens",..: 169 169 169 169 169 169 169 169 169 169 ...
##  $ service_zone         : Factor w/ 5 levels "Airports","Boro Zone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ per_tip              : num  0.24 0.232 0.163 0.254 0.133 ...
##  $ pickup_datetime      : POSIXct, format: "0019-06-29 07:40:00" "0019-06-12 12:12:00" ...
##  $ pickup_time          : chr  "07:40" "12:12" "20:09" "07:11" ...
##  $ pickup_hrs           : chr  "07" "12" "20" "07" ...
##  $ pickup_hrs_num       : num  7 12 20 7 6 16 12 4 3 6 ...
##  $ pickup_time_type     : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 1 2 3 1 1 2 2 4 4 1 ...
##  $ dropoff_datetime     : POSIXlt, format: "0019-06-29 08:01:00" "0019-06-12 12:47:00" ...
##  $ dropoff_time         : chr  "08:01" "12:47" "20:36" "07:45" ...
##  $ dropoff_hrs          : chr  "08" "12" "20" "07" ...
##  $ dropoff_hrs_num      : num  8 12 20 7 7 16 12 4 3 7 ...
##  $ dropoff_time_type    : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 1 2 3 1 1 2 2 4 4 1 ...
##  $ trip_duration_mins   : 'difftime' num  21 35 27 34 ...
##   ..- attr(*, "units")= chr "mins"
##  $ trip_duration_mins1  : num  21 35 27 34 26 34 31 31 24 31 ...
## Error in t.test.default(short_dist$tip_amount, long_dist$tip_amount): not enough 'x' observations
## Error in print(result): object 'result' not found

3.0.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.1

3.0.2 GEOGRAPHIC LOCATION AND TIP

As previously discussed, location can also impact the amount of tipping. This is because each location has its own set of customs and traditions and socioeconomic variables that affect tipping [1]. In this context, a person’s wealth may also impact the amount that he or she tips to the taxi driver. While most of this analysis was done on a larger, regional scale, given the distinct culture of each borough in New York, there may be microcosims of tipping culture found in the city that differ between each borough. The dataset does not provide information on the passenger’s home location, wealth, or even purpose of travel; it simply provides pick up and drop off locations. However, pick up and drop off locations may provide some context into a passenger’s background for two main reasons. The first reason is quite simple: Passengers may use taxis to go from home to work and from work to home [2]. Secondly, each individual has their own time-space geography. Some geographers have argued that people’s time-spaces are often segregated–whether due to gender or race–which can affect their access to resources. Therefore, this may provide some context into how an individual lives.

[1] https://scholarship.sha.cornell.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=1102&context=articles [2] https://ieeexplore.ieee.org/abstract/document/4624004; https://onlinelibrary.wiley.com/doi/pdf/10.1111/0033-0124.00158?casa_token=fgUlwmzlofwAAAAA:k19TGhhaKN9uLpNH1QXQgXrqRrm77fEgGmBisvuspaVDzvnNfXkrcvBvC4QsJMOmptyw8ew-qJWCd4s; https://biblio.ugent.be/publication/3029997/file/6779790.pdf; https://www.tandfonline.com/doi/full/10.1080/02723638.2016.1142152

To better understand this issue, descriptive statistics were run. First, we calculate the number of trips in each borough, firstly grouped by pickup location and secondly grouped by drop off location.

These bar charts show that Manhattan has the highest number of both pick up and drop offs, followed by Queens and Brooklyn in second and third, respectively. We also looked at the frequency of various drop off and pick up combinations.

## Error in attributes(out) <- attributes(col): 'names' attribute [11] must be the same length as the vector [1]

This table shows that Manhattan to Manhattan has the highest number of trips, followed by Queens to Manhattan, Manhattan to Queens, and Manhattan to Brooklyn.

Considering the Manhattan and Queens have the highest number of pick ups and drop offs, the fact that the number of trips within and between these places are also the highest make sense. The fact that Manhattan scores the highest in both measurements is also reasonable because yellow taxis (the focus of this study) mainly serve Manhattan, whereas green taxis usually serve the other boroughs that have been traditionally underserved by taxis (SOURCE).

With these descriptive statistics in mind, we decided to compare the mean tipping percentage for each borough based on drop off and pick up location to see if they were statistically different.

We used the ANOVA test. Here are the results:

## Call:
##    aov(formula = per_tip ~ Borough_pu, data = processed_df)
## 
## Terms:
##                 Borough_pu Residuals
## Sum of Squares     0.50296  61.33039
## Deg. of Freedom          4     12215
## 
## Residual standard error: 0.07085836
## Estimated effects may be unbalanced
##                Df Sum Sq Mean Sq F value Pr(>F)    
## Borough_pu      4   0.50 0.12574   25.04 <2e-16 ***
## Residuals   12215  61.33 0.00502                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Call:
##    aov(formula = per_tip ~ Borough_do, data = processed_df)
## 
## Terms:
##                 Borough_do Residuals
## Sum of Squares     0.74527  61.08808
## Deg. of Freedom          6     12213
## 
## Residual standard error: 0.07072404
## Estimated effects may be unbalanced
##                Df Sum Sq Mean Sq F value Pr(>F)    
## Borough_do      6   0.75  0.1242   24.83 <2e-16 ***
## Residuals   12213  61.09  0.0050                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-values are 1.09918810^{-20} and 1.932089210^{-29} for pick up and drop off, respectively. Both of these are smaller than a significance level of 0.05 (a 0.95 confidence level). Thus, we can reject the null hypothesis that the means are the same and say the means are statistically different at a significance level of 0.05.

Because they are significant, the next step would be to conduct a Tukey’s HSD test, which looks at each pair of variables to see if they are significantly different. Here are the results from that analysis:

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = per_tip ~ Borough_pu, data = processed_df)
## 
## $Borough_pu
##                            diff          lwr         upr     p adj
## Brooklyn-Bronx     -0.147655747 -0.341855003  0.04654351 0.2312841
## Manhattan-Bronx    -0.112689738 -0.306012622  0.08063315 0.5036324
## Queens-Bronx       -0.141636653 -0.335165845  0.05189254 0.2675855
## Unknown-Bronx      -0.119411173 -0.313626929  0.07480458 0.4480395
## Manhattan-Brooklyn  0.034966008  0.016362693  0.05356932 0.0000030
## Queens-Brooklyn     0.006019094 -0.014618111  0.02665630 0.9319400
## Unknown-Brooklyn    0.028244573  0.001936672  0.05455247 0.0281303
## Queens-Manhattan   -0.028946915 -0.038235632 -0.01965820 0.0000000
## Unknown-Manhattan  -0.006721435 -0.025496198  0.01205333 0.8658243
## Unknown-Queens      0.022225480  0.001433592  0.04301737 0.0292158
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = per_tip ~ Borough_do, data = processed_df)
## 
## $Borough_do
##                                 diff          lwr          upr     p adj
## Brooklyn-Bronx           0.007440429 -0.029833482  0.044714340 0.9971509
## EWR-Bronx                0.005236027 -0.064790592  0.075262646 0.9999905
## Manhattan-Bronx          0.041394710  0.005574339  0.077215081 0.0117042
## Queens-Bronx             0.017922932 -0.019339844  0.055185707 0.7921154
## Staten Island-Bronx      0.055393986 -0.156202712  0.266990684 0.9875895
## Unknown-Bronx            0.020288045 -0.019916376  0.060492466 0.7521602
## EWR-Brooklyn            -0.002204402 -0.063315819  0.058907016 0.9999999
## Manhattan-Brooklyn       0.033954281  0.023278280  0.044630282 0.0000000
## Queens-Brooklyn          0.010482503 -0.004329399  0.025294405 0.3604522
## Staten Island-Brooklyn   0.047953557 -0.160862248  0.256769362 0.9938422
## Unknown-Brooklyn         0.012847617 -0.008301224  0.033996457 0.5539288
## Manhattan-EWR            0.036158683 -0.024077186  0.096394552 0.5684189
## Queens-EWR               0.012686905 -0.048417721  0.073791531 0.9964559
## Staten Island-EWR        0.050157959 -0.166909826  0.267225744 0.9936320
## Unknown-EWR              0.015052018 -0.047889672  0.077993708 0.9923343
## Queens-Manhattan        -0.023471778 -0.034108837 -0.012834720 0.0000000
## Staten Island-Manhattan  0.013999276 -0.194561974  0.222560526 0.9999950
## Unknown-Manhattan       -0.021106665 -0.039573609 -0.002639721 0.0133007
## Staten Island-Queens     0.037471054 -0.171342764  0.246284872 0.9984320
## Unknown-Queens           0.002365113 -0.018764095  0.023494322 0.9998970
## Unknown-Staten Island   -0.035105941 -0.244464703  0.174252822 0.9989324

While the means are overall not the same, the following pick up pairs have significant differences in tipping percentage (excluding Unknown) given their small p-values: Manhattan and Brooklyn, and Queens and Manhattan. For drop off pairs, Manhattan and Bronx, Manhattan and Brooklyn, and Manhattan and Queens are significant.

ANOVA assumes a normal distribution, and as previously highlighted, the tipping amount is not necessarily normally distributed. Nonetheless, a look at the tipping amount can provide some context to the situation. We compared the mean tipping amount for each borough to see if those were statistically different. Here are the results for raw tip amount based on location:

## Call:
##    aov(formula = tip_amount ~ Borough_pu, data = processed_df)
## 
## Terms:
##                 Borough_pu Residuals
## Sum of Squares     9594.53  37995.94
## Deg. of Freedom          4     12215
## 
## Residual standard error: 1.763689
## Estimated effects may be unbalanced
##                Df Sum Sq Mean Sq F value Pr(>F)    
## Borough_pu      4   9595  2398.6   771.1 <2e-16 ***
## Residuals   12215  37996     3.1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Call:
##    aov(formula = tip_amount ~ Borough_do, data = processed_df)
## 
## Terms:
##                 Borough_do Residuals
## Sum of Squares     8972.01  38618.47
## Deg. of Freedom          6     12213
## 
## Residual standard error: 1.778223
## Estimated effects may be unbalanced
##                Df Sum Sq Mean Sq F value Pr(>F)    
## Borough_do      6   8972  1495.3   472.9 <2e-16 ***
## Residuals   12213  38618     3.2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For these ANOVA analyses, the null hypothesis is that all means for different locations are the same. The p-values are 0 and 0 for pick up and drop off, respectively. Both of these are smaller than a significance level of 0.05 (a 0.95 confidence level). Thus, we can reject the null hypothesis that the means are the same and say the means are statistically different at a significance level of 0.05.

As with the previous, because both are significant, we can run Tukey’s HSD test:

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = tip_amount ~ Borough_pu, data = processed_df)
## 
## $Borough_pu
##                          diff        lwr         upr     p adj
## Brooklyn-Bronx      1.7965138 -3.0371711  6.63019867 0.8490671
## Manhattan-Bronx     1.4192781 -3.3925936  6.23114980 0.9292692
## Queens-Bronx        6.0870444  1.2700377 10.90405122 0.0051451
## Unknown-Bronx       2.7872897 -2.0468058  7.62138529 0.5148065
## Manhattan-Brooklyn -0.3772357 -0.8402784  0.08580714 0.1713150
## Queens-Brooklyn     4.2905307  3.7768637  4.80419765 0.0000000
## Unknown-Brooklyn    0.9907760  0.3359634  1.64558849 0.0003552
## Queens-Manhattan    4.6677663  4.4365670  4.89896563 0.0000000
## Unknown-Manhattan   1.3680116  0.9007014  1.83532179 0.0000000
## Unknown-Queens     -3.2997547 -3.8172718 -2.78223764 0.0000000
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = tip_amount ~ Borough_do, data = processed_df)
## 
## $Borough_do
##                                diff         lwr          upr     p adj
## Brooklyn-Bronx           -0.9486873  -1.8858699  -0.01150461 0.0450159
## EWR-Bronx                 8.6900490   6.9293609  10.45073715 0.0000000
## Manhattan-Bronx          -3.0039359  -3.9045720  -2.10329976 0.0000000
## Queens-Bronx              0.4975155  -0.4393872   1.43441820 0.7042679
## Staten Island-Bronx       5.6108824   0.2906798  10.93108488 0.0309347
## Unknown-Bronx            -0.1438463  -1.1547112   0.86701851 0.9995836
## EWR-Brooklyn              9.6387363   8.1022042  11.17526838 0.0000000
## Manhattan-Brooklyn       -2.0552486  -2.3236767  -1.78682058 0.0000000
## Queens-Brooklyn           1.4462028   1.0737853   1.81862031 0.0000000
## Staten Island-Brooklyn    6.5595696   1.3092874  11.80985181 0.0043210
## Unknown-Brooklyn          0.8048409   0.2730930   1.33658891 0.0001652
## Manhattan-EWR           -11.6939849 -13.2085030 -10.17946684 0.0000000
## Queens-EWR               -8.1925335  -9.7288948  -6.65617216 0.0000000
## Staten Island-EWR        -3.0791667  -8.5369294   2.37859610 0.6404759
## Unknown-EWR              -8.8338953 -10.4164462  -7.25134448 0.0000000
## Queens-Manhattan          3.5014514   3.2340025   3.76890031 0.0000000
## Staten Island-Manhattan   8.6148182   3.3709364  13.85870012 0.0000265
## Unknown-Manhattan         2.8600896   2.3957728   3.32440627 0.0000000
## Staten Island-Queens      5.1133668  -0.1368654  10.36359906 0.0621414
## Unknown-Queens           -0.6413618  -1.1726162  -0.11010748 0.0068348
## Unknown-Staten Island    -5.7547287 -11.0186625  -0.49079484 0.0216086

For pick up locations, there is no statistical difference for the between the following pairs (excluding Unknown) given their large p-values: Brooklyn and Bronx, and Manhattan and Bronx. For drop off locations, there is no difference for: Queens and Bronx, Staten Island and Bronz, Staten Island and Brooklyn, Staten Island and EWR, and Staten Island and Queens. All the Manhattan drop off locations are significant.

Based on this analysis, it seems that being dropped off in Manhattan is significantly different from being dropped off in another location. The same seems to be true for being picked up in Manhattan. FINISH THIS

<<<<<<< HEAD

## 
## Call:
## lm(formula = per_tip ~ Borough_pu, data = processed_df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.169215 -0.035061  0.003729  0.044118  0.196045 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.37714    0.07086   5.322 1.04e-07 ***
## Borough_puBrooklyn  -0.14766    0.07118  -2.074   0.0381 *  
## Borough_puManhattan -0.11269    0.07086  -1.590   0.1118    
## Borough_puQueens    -0.14164    0.07094  -1.997   0.0459 *  
## Borough_puUnknown   -0.11941    0.07119  -1.677   0.0935 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07086 on 12215 degrees of freedom
## Multiple R-squared:  0.008134,   Adjusted R-squared:  0.007809 
## F-statistic: 25.04 on 4 and 12215 DF,  p-value: < 2.2e-16

======= >>>>>>> 29d0f71f0251384c4f2b8c5e24b3c721927ff061

4 Limitations

One limitation was hardware limitations. Another limitation was that cash tips by nature are not documented. Also, as previously discussed, our data was not perfectly normally distributed. While it was roughly normal, not being normally distributed may slightly contradict some of the assumptions of the statistical analyses.

5 Future Considerations

We might have to do just an analysis on the neighborhoods within Manhattan since yellow taxis are mainly found in Manhattan. Or, we could combine the yellow taxi data with green taxi data to do an analysis that encompasses all five boroughs.

6 Conclusions